It is important to know the encodings of the files you are working with. This is probably one of the largest difficulties faced by people working with Chinese language texts.
Chinese language digital texts coming a variety of encodings
This is the easiest to work with. UTF, which stands for Unicode Transformation Format, is an international character set that encodes texts in all languages. It extends the ASCII character set and is the most common encoding on the internet.
This was the official character set of the People's Republic of China. It is a simplified character set.
GBK extends GB 2312, adding missing characters.
GB 18030 was designed to replace GB 2312 and GBK in 2005. It is a Unicode format but maintains compatibility with GB 2312. It also allows for traditional characters, as it is a Unicode encoding.
Big5 is a traditional Chinese format that is common in Taiwan and Hong Kong.
Many websites and texts files containing Chinese text still use GB 2312.
If we want to perform Stylometric analysis and compare a variety of texts, it is easiest to ensure that they are all in the same folder. This will allow us to write code that cleans the text and performs the analysis quickly and easily.
This will need to be in the same folder as this Jupyter notebook. I have provided a collection of files to analyze as part of the workshop. Name the folder something sensible. I have chosen to call the included folder "corpus."
If we want to keep track of information about our texts, we need to decide on a way to do this. I prefer to include a metadata file in the same folder as my Python script that describes each text. Each text is given an idea for a file name. We will use that ID to look up information about the text.
In [1]:
%pylab inline
pylab.rcParams['figure.figsize']=(12,8)
In [2]:
my_file = open("test.txt", "r")
file_contents = my_file.read()
print(file_contents)
open() takes several parameters. First, the file path then the open mode. "r" means read "w" means write "a" means append
If you open a file in write mode, if it exists, the program will wipe the files contents without warning you. If it doesn't exist, it will automatically create the file.
In [3]:
my_file = open("test.txt", "r", encoding="utf-8")
file_contents = my_file.read()
print(file_contents)
When you open many files at once, you will sometimes run in to errors in encoding no matter what you do. You have several options. You can delete the bad character, or replace it with a question mark. The corpus I've provided doesn't have any of these issues, but as you adapt it to run in the wild, you may run in to some issues
In [4]:
my_file = open("test.txt", "r", encoding="utf-8", errors="replace")
file_contents = my_file.read()
print(file_contents)
In [5]:
import os
for root, dirs, files in os.walk("corpus"):
for filename in files:
# I do not want to open hidden files
if filename[0] != ".":
# open the file
f = open(root + "/" + filename, "r", encoding = "utf8")
# read the contents to a variable
c = f.read()
# make sure to close the file when you are done
f.close()
# check to see if your code is working
# here I am just printing the length of the string
# printing the string would take up a lot of room.
print(len(c))
info_list = []
for root, dirs, files in os.walk("corpus"): for filename in files: if filename[0] != ".": f = open(root + "/" + filename, "r") c = f.read() f.close() info_list.append(c)
for c in info_list: print(len(c))
The texts must first be cleaned before they can do anything. The best way to do this is to write a function and will perform the cleaning. We will then call it when we need our texts to be cleaned. We will remove most unwanted characters with regular expressions. As we will have multiple characters not matching the regex, we will use a loop for the rest.
In [6]:
import re
def clean(instring):
# Remove mid-file markers
instring = re.sub(r'~~~START\|.+?\|START~~~', "", instring)
# This regex will remove all letters and numbers
instring = re.sub(r'[a-zA-Z0-9]', "", instring)
# A variety of characters to remove
unwanted_chars = ['』','。', '!', ',', ':', '、', '(',
')', ';', '?', '〉', '〈', '」', '「',
'『', '“', '”', '!', '"', '#', '$', '%',
'&', "'", '(', ')', '*', '+', ',', '-',
'.', '/', "《", "》", "·", "a", "b"]
for char in unwanted_chars:
# replace each character with nothing
instring = instring.replace(char, "")
# return the resulting string.
return instring
In [7]:
info_list = []
# just for demonstration purposes
not_cleaned = []
for root, dirs, files in os.walk("corpus"):
for filename in files:
if filename[0] != ".":
f = open(root + "/" + filename, "r", encoding="utf8")
c = f.read()
f.close()
not_cleaned.append(c)
info_list.append(clean(c))
print("This is before:" + not_cleaned[0][:30])
print("This is after: " + info_list[0][:30])
In [8]:
info_list = []
# just for demonstration purposes
not_cleaned = []
for root, dirs, files in os.walk("corpus"):
for filename in files:
if filename[0] != ".":
f = open(root + "/" + filename, "r", encoding="utf8")
c = f.read()
f.close()
not_cleaned.append(c)
# remove white space
c = re.sub("\s+", "", c)
info_list.append(clean(c))
print("This is before:" + not_cleaned[0][:30])
print("This is after: " + info_list[0][:30])
Now that we have a clean string to analyze, we will want to decide how to analyze it. The first step is to decide if we want to look at the entire text, or break it apart into equal lengths. There are advantages and disadvantages to each. I will show you how to break apart the texts. To not break the text apart, simply change break_apart to False.
In [9]:
# This function does not retain the leftover small section at
# the end of the text
def textBreak(inputstring):
# Decide how long each section should be
divlim = 10000
# Calculate how many loops to run
loops = len(inputstring)//divlim
# Make an empty list to save the results
save = []
# Save chunks of equal length
for i in range(0, loops):
save.append(inputstring[i * divlim: (i + 1) * divlim])
return save
break_apart = True
if break_apart == True:
broken_chunks = []
for item in info_list:
broken_chunks.extend(textBreak(item))
# Check to see if it worked.
print(len(broken_chunks[0]))
In [10]:
# Create a dictionary to store the information
metadata = {}
# open and extract the string
metadatafile = open("metadata.txt", "r", encoding="utf8")
metadatastring = metadatafile.read()
metadatafile.close()
# split into by line
lines = metadatastring.split("\n")
for line in lines:
# split using tabs
cells = line.split("\t")
# use the first column as the key, which I use store
# the rest of the columns
metadata[cells[0]] = cells[1:]
print(metadata)
In [11]:
# Create empty lists to store the information
info_list = []
title_list = []
author_list = []
era_list = []
genre_list = []
# Create dictionaries store unique info:
title_author = {}
title_era = {}
title_genre = {}
for root, dirs, files in os.walk("corpus"):
for filename in files:
if filename[0] != ".":
f = open(root + "/" + filename, "r", encoding="utf8")
c = f.read()
f.close()
c = re.sub("\s+", "", c)
c = clean(c)
# Get metadata. the [:-4] removes the .txt from filename
metainfo = metadata[filename[:-4]]
info_list.append(c)
title_list.append(metainfo[0])
author_list.append(metainfo[1])
era_list.append(metainfo[2])
genre_list.append(metainfo[3])
title_author[metainfo[0]] = metainfo[1]
title_era[metainfo[0]] = metainfo[2]
title_genre[metainfo[0]] = metainfo[3]
print(title_list)
In [12]:
# Create empty lists/dictionaries to store the information
info_list = []
title_list = []
author_list = []
era_list = []
genre_list = []
title_author = {}
title_era = {}
title_genre = {}
# We should also track which section number
section_number = []
for root, dirs, files in os.walk("corpus"):
for filename in files:
if filename[0] != ".":
f = open(root + "/" + filename, "r", encoding="utf8")
c = f.read()
f.close()
c = re.sub("\s+", "", c)
c = clean(c)
# Get metadata. the [:-4] removes the .txt from filename
metainfo = metadata[filename[:-4]]
# The dictionary formation stays the same
title_author[metainfo[0]] = metainfo[1]
title_era[metainfo[0]] = metainfo[2]
title_genre[metainfo[0]] = metainfo[3]
# Break the Text apart
broken_sections = textBreak(c)
# We will need to extend, rather than append
info_list.extend(broken_sections)
title_list.extend([metainfo[0] for i in
range(0,len(broken_sections))])
author_list.extend([metainfo[1] for i in
range(0,len(broken_sections))])
era_list.extend([metainfo[2] for i in
range(0,len(broken_sections))])
genre_list.extend([metainfo[3] for i in
range(0,len(broken_sections))])
section_number.extend([i for i in range(0, len(broken_sections))])
print(author_list[:20])
In [13]:
# Create empty lists/dictionaries to store the information
info_list = []
title_list = []
author_list = []
era_list = []
genre_list = []
section_number = []
title_author = {}
title_era = {}
title_genre = {}
break_apart = False
for root, dirs, files in os.walk("corpus"):
for filename in files:
if filename[0] != ".":
f = open(root + "/" + filename, "r", encoding="utf8")
c = f.read()
f.close()
c = re.sub("\s+", "", c)
c = clean(c)
# Get metadata. the [:-4] removes the .txt from filename
metainfo = metadata[filename[:-4]]
title_author[metainfo[0]] = metainfo[1]
title_era[metainfo[0]] = metainfo[2]
title_genre[metainfo[0]] = metainfo[3]
if not break_apart:
info_list.append(c)
title_list.append(metainfo[0])
author_list.append(metainfo[1])
era_list.append(metainfo[2])
genre_list.append(metainfo[3])
else:
broken_sections = textBreak(c)
info_list.extend(broken_sections)
title_list.extend([metainfo[0] for i in
range(0,len(broken_sections))])
author_list.extend([metainfo[1] for i in
range(0,len(broken_sections))])
era_list.extend([metainfo[2] for i in
range(0,len(broken_sections))])
genre_list.extend([metainfo[3] for i in
range(0,len(broken_sections))])
section_number.extend([i for i in range(0, len(broken_sections))])
In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [15]:
vectorizer = CountVectorizer(analyzer="word", ngram_range=(1,1),
token_pattern="\S+", max_features = 100)
In [16]:
vectorizer = CountVectorizer(analyzer="char",ngram_range=(1,1),
max_features = 100)
In [17]:
word_count_matrix = vectorizer.fit_transform(info_list)
# This will tell you the features found by the vectorizer.
vocab = vectorizer.get_feature_names()
print(vocab)
In [18]:
import pandas as pd
from pandas import Series
fullcorpus = ""
for text in info_list:
fullcorpus += text
tokens = list(fullcorpus)
corpus_series = Series(tokens)
values = corpus_series.value_counts()
print(values[:10])
In [19]:
# Just use this instead when creating your vectorizer.
# To get TF, tell it to not use idf. otherwise, set to true
vectorizer = TfidfVectorizer(use_idf=False, analyzer="char",
ngram_range=(1,1), max_features=10)
If you are using texts of different length, you will need to be sure that you use some sort of normalization if you are hoping to use euclidean distance as a similarity measure. One of the easier ways to normalize is to adjust the raw character count to occurrences per thousand characters. The code below does this using pandas.
In [20]:
from pandas import DataFrame
# Recreate a CountVectorizer object
vectorizer = CountVectorizer(analyzer="char", ngram_range=(1,1),
max_features=100)
word_count_matrix=vectorizer.fit_transform(info_list)
vocab = vectorizer.get_feature_names()
# We will need a dense matrix, not a sparse matrix
dense_words = word_count_matrix.toarray()
corpus_dataframe = DataFrame(dense_words, columns=vocab)
# Calculate how long each document is
doclengths = corpus_dataframe.sum(axis=1)
# Make a series that is the same length as the document length series
# but populated with 1000.
thousand = Series([1000 for i in range(0,len(doclengths))])
# Divide this by the length of each document
adjusteddoclengths = thousand.divide(doclengths)
# Multiply the corpus DataFrame by this adjusting factor
per_thousand = corpus_dataframe.multiply(adjusteddoclengths, axis = 0)
print(per_thousand)
# Convert back to word_count_matrix
word_count_matrix = per_thousand.as_matrix()
In [21]:
my_vocab = ["的", "之", "曰", "说"]
vectorizer = CountVectorizer(analyzer="char",ngram_range=(1,1),
vocabulary = my_vocab)
Now we can start calculating the relationships among these works. We will have to decide if we want to use euclidean distance or cosine similarity. We will import several tools to help us do this
Each vector is understood as a point in space. You will need to calculate the distance between each point. We will use these to judge similarity.
Here we are interested in the direction that each vector points. You will calculate the angle between each point.
In [22]:
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
euc_or_cosine = "euc"
if euc_or_cosine == "euc":
similarity = euclidean_distances(word_count_matrix)
elif euc_or_cosine == "cos":
similarity = cosine_similarity(word_count_matrix)
The similarity variable now contains the similarity measure between each document in the corpus. You can use this to create a linkage matrix which will allow you to visualize the relationships among these texts as a dendrogram. Here we will use the "Ward" algorithm to cluster the texts together.
In [23]:
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(similarity)
In [24]:
# import the plotting library.
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager
# Set the font to a Chinese Font Family
# STHeiti works for Macs, SimHei should work on Windows
# Linux does not come with a compatible Chinese font.
# Here I have defaulted to a Japanese font.
# I've added logic that checks what system you are using.
from sys import platform
if platform == "linux" or platform == "linux2":
print("Sorry, I can't see the appropriate fonts, defaulting to Japanese")
matplotlib.rc('font', family="TakaoPGothic")
elif platform == "win32" or platform == "win64":
matplotlib.rc('font', family="SimHei")
elif platform == "darwin":
matplotlib.rc('font', family='STHeiti')
# Make the Dendrogram
dendrogram(linkage_matrix, labels=title_list)
plt.show()
In [25]:
dendrogram(linkage_matrix, labels=title_list)
# Add a Title
plt.title("Textual Relationships")
# Add x and y axis labels
plt.xlabel("Texts")
plt.ylabel("Distance")
# Set the angle of the labels so they are easier to read
plt.xticks(rotation=60)
# Show the plot
plt.show()
In [26]:
# Set the size of the Figure
# This will make a Seven inch by Seven inch figure
plt.figure(figsize=(7,7))
dendrogram(linkage_matrix, labels=title_list)
# Add a Title
plt.title("Textual Relationships")
# Add x and y axis labels
plt.xlabel("Texts")
plt.ylabel("Distance")
# Set the angle of the labels so they are easier to read
plt.xticks(rotation=60)
plt.savefig("results.pdf")
In [27]:
dendrogram(linkage_matrix, labels=title_list)
plt.title("Textual Relationships")
plt.xlabel("Texts")
plt.ylabel("Distance")
plt.xticks(rotation=60)
# Create a dictionary for color selection
# Here we are using genre as the basis for color
# You would need to change this if you wanted to color based on authorship.
color_dict = {"传奇":"red", "小说":"blue", "话本":"magenta"}
# Return information about the tick labels
plt_info = plt.gca()
tick_labels = plt_info.get_xmajorticklabels()
# Iterate through each tick label and assign a new color
for tick_label in tick_labels:
# Get the genre from the title to genre dictionary
genre = title_genre[tick_label.get_text()]
# Get the color from the dictionary
color = color_dict[genre]
# Set the color
tick_label.set_color(color)
# Show the plot
plt.show()
There are other ways to visualize the relationships among these texts. Principal component analysis is a way to explore the variance within the dataset. We can use much of the same data that we used for hierarchical cluster analysis.
You will need to import a few new modules
In [28]:
from sklearn.decomposition import PCA
PCA decomposes the dataset into abstracted components that describe the variance. These can be used as axes on which to replot the data. This will often allow you to get the best view of the data (or at least the most comprehensive).
Generally you will only need the first two principal components (which will describe the most variance within the dataset. Sometimes you will be interested in the third and fourth components. For now, just the first two will be fine.
In [29]:
# Create the PCA object
pca = PCA(n_components = 2)
# PCA requires a dense matrix. word_count_matrix is sparse
# unless you ran the normalization to per 1000 code above!
# Convert it to dense matrix
#dense_words = word_count_matrix.toarray()
dense_words = word_count_matrix
# Analyze the dataset
my_pca = pca.fit(dense_words).transform(dense_words)
In [30]:
import numpy as np
# The input here will be the information you want to use to color
# the graph.
def info_for_graph(input_list):
# This will return the unique values.
# [a, a, a, b, b] would become
# {a, b}
unique_values = set(input_list)
# create a list of numerical label and a dictionary to
# populate a list
unique_labels = [i for i in range(0, len(unique_values))]
unique_dictionary = dict(zip(unique_values, unique_labels))
# make class list
class_list = []
for item in input_list:
class_list.append(unique_dictionary[item])
return unique_labels, np.array(class_list), unique_values
In [31]:
unique_labels, info_labels, unique_genres = info_for_graph(genre_list)
# Make a color list, the same length as unique labels
colors = ["red", "magenta", "blue"]
# Make the figure
plt.figure()
# Plot the points using color information.
# This code is partially adapted from brandonrose.org/clustering
for color, each_class, label in zip(colors, unique_labels, unique_genres):
plt.scatter(my_pca[info_labels == each_class, 0],
my_pca[info_labels == each_class, 1],
label = label, color = color)
# You should title the plot label your axes
plt.title("Principal Component Analysis")
plt.xlabel("PC1: " + "{0:.2f}".format(pca.explained_variance_ratio_[0] * 100)+"%")
plt.ylabel("PC2: " + "{0:.2f}".format(pca.explained_variance_ratio_[1] * 100)+"%")
# Give it a legend
plt.legend()
plt.show()
In [32]:
unique_labels, info_labels, unique_genres = info_for_graph(genre_list)
colors = ["red", "magenta", "blue"]
plt.figure()
for color, each_class, label in zip(colors, unique_labels, unique_genres):
plt.scatter(my_pca[info_labels == each_class, 0],
my_pca[info_labels == each_class, 1],
label = label, color = color)
for i, text_label in enumerate(title_list):
plt.annotate(text_label, xy = (my_pca[i, 0], my_pca[i, 1]),
xytext=(my_pca[i, 0], my_pca[i, 1]),
size=8)
plt.title("Principal Component Analysis")
plt.xlabel("PC1: " + "{0:.2f}".format(pca.explained_variance_ratio_[0] * 100)+"%")
plt.ylabel("PC2: " + "{0:.2f}".format(pca.explained_variance_ratio_[1] * 100)+"%")
plt.legend()
plt.show()
In [33]:
loadings = pca.components_
# This will plot the locations of the loadings, but make the
# points completely transparent.
plt.scatter(loadings[0], loadings[1], alpha=0)
# Label and Title
plt.title("Principal Component Loadings")
plt.xlabel("PC1: " + "{0:.2f}".format(pca.explained_variance_ratio_[0] * 100)+"%")
plt.ylabel("PC2: " + "{0:.2f}".format(pca.explained_variance_ratio_[1] * 100)+"%")
# Iterate through the vocab and plot where it falls on loadings graph
# numpy array the loadings info is held in is in the opposite format of the
# pca information
for i, txt in enumerate(vocab):
plt.annotate(txt, (loadings[0, i], loadings[1, i]), horizontalalignment='center',
verticalalignment='center', size=8)
plt.show()
In [ ]: